Introduction

This tutorial provides a brief introduction into mapping linguistic data using R. That means we’ll be working with regional data and we’ll want to map features to help us understand regional distribution of language varieties. This can be useful for dialectology, but is also used in dialectometry and NLP approaches.

First Steps

In preparation for our maps, we’ll need to load a couple of packages. For mapping we’ll need ‘maps’ and for some optional maps ‘rworldmap’, which also loads ‘sp’. Depending on your version of R you might also need ‘broom’. The main work for the maps will be done using ggplot, so we need the tidyverse package as well.

library(maps) # to get US maps
library(rworldmap) # mapping other country outlines
library(tidyverse) # making pretty maps
library(sf) # to change the geo-information to suitable format

Data

The data we’ll be using is based on a collection of 1 billion Tweets / 9 billion words. All Tweets are geocoded American Tweets collected between 2013 and 2014. From this a US Twitter swearing dataset was compiled. See Huang et al. 2016; and Grieve et al. 2017 for more information.

The initial step now is reading in the dataset.

norm_swear <- read.table("BSLSS_SWEAR.txt", header = TRUE, sep = ",")

The basic dimension of the dataset is 52 swear words measured across 3,085 locations, denoted by state plus county (= 53).

dim(norm_swear)
## [1] 3085   53

The locations are coded as state-county pairs. These are the first 15 rows of our dataset.

head(norm_swear, 15)
##              county     ass asshole bastard   bitch bitched bitchy bloody
## 1   alabama,autauga 1520421   49600    9538  962106    6995   8903   5087
## 2   alabama,baldwin 1246775   54318    6578  807348    2334   7851  14004
## 3   alabama,barbour 2263661   29188    3243  959948    3243   6486   3243
## 4      alabama,bibb 1451192   14629    2926 1009398       0   8777      0
## 5    alabama,blount  559433   72969    4230  506556    2115   5288   3173
## 6   alabama,bullock 2168413   56605       0 1184354       0   8708      0
## 7    alabama,butler 2638306   38680   11282 1806683    6447   4835   3223
## 8   alabama,calhoun 1604872   38763    8012  917534    2166   5197   4115
## 9  alabama,chambers 1881425   34756    5902 1438120    1312   1967  20329
## 10 alabama,cherokee  380377   37028    1683  272660    1683   6732   5049
## 11  alabama,chilton 1202164   58668    6400 1122162    5333   6400   6400
## 12  alabama,choctaw 1398501   26458       0  653894       0   7559   3780
## 13   alabama,clarke 1405435   50611    7786  809780       0   7786      0
## 14     alabama,clay 1322939   35436    7087  668557       0  14174   2362
## 15 alabama,cleburne  629833   49888    6236  433400       0  18708  15590
##    bullshit  cock   crap crappy  cunt    damn damnit damned  darn   dick
## 1    120184 15897 146255  13354 22892 1206925  19077   8903 13990 210481
## 2     98452  7002 109910  10397  9124  907073   9760  10185 10185 113729
## 3     74591  3243 113507  19458  3243 1258310      0  25945 22701 136209
## 4    105328     0  90699   8777  2926 1176168  17555   2926 17555  93625
## 5    101523  9518 201988   6345  9518  469543  13748  23266  8460  59222
## 6    182878  8708  21771   4354     0 1240960      0      0  4354 300443
## 7    164390  9670  46738   4835     0 1513359   3223  11282 14505 267537
## 8     95500  8446  74711   6713 10395 1027543  14292  14292 12344 147689
## 9    155419  7869  55741   3935  8525 1080065  13116   8525 10492 172469
## 10    40394  1683 121182  15148  3366  336617   5049  11782  6732  38711
## 11   129070  5333 120536  13867 13867  802154   9600  22401 11734 120536
## 12    56696     0  22678   3780     0  578299      0      0 15119  49137
## 13    66184  7786 147941   3893  3893 1140699      0   3893 31145  89543
## 14    96858  4725 203166  14174  4725 1030002  11812  14174 21262 153555
## 15   112247 12472 159017  15590  9354  654777  24944  21826 21826 102893
##    dickhead douche douchebag dumbass  dyke   fag faggot fatass freaking friggin
## 1      3179  14626      6359   43241  2544 42605  40697   6359   167876    2544
## 2      2971  18884      5729   29069  2546 19521  15489   4031   170593    2546
## 3         0   3243      3243   22701     0  9729      0      0   175126    3243
## 4      2926   5852         0   11703  2926  2926   8777      0   187251    8777
## 5      9518  13748      4230   25381     0 16920  32783   4230   195643    5288
## 6         0   4354         0   52251 17417 13063  52251      0    47897       0
## 7         0   4835      1612   29010     0  6447  12893   1612    78972    1612
## 8       650   7363      2815   18840  3898 11044   9312   3032   104378    2599
## 9      1967    656         0   26231     0  9181   9837   8525   127221    1312
## 10     1683   6732      1683   15148     0  1683  10099      0   107717   13465
## 11     1067  16000      3200   27734  3200 11734  13867      0   151471    2133
## 12        0      0         0    3780     0     0      0      0    64255   11339
## 13        0   3893         0   58398  3893  3893   7786      0   179086    7786
## 14        0   7087     14174   25986  2362  9450  14174   2362   179542    2362
## 15        0  12472      3118   12472  3118 18708   6236   6236   149663    9354
##       fuck fucked fucker fuckery fucking goddamn   gosh   hell    hoe  homo
## 1  1441570 212388  21620    6359  592017    5087  69948 695667 268347 17169
## 2  1137714 139191  15065    2546  462767    5941  81690 573101 252920  7426
## 3  1115615 158910  12972    3243  376196    9729  64861 901573 369710  3243
## 4  1351715 236989  20481   17555  833850    8777  38035 506162 298431  2926
## 5   775168  88832  12690    3173  319374    4230 132191 379653 131134  1058
## 6  1941993 278672  21771    8708  574760   13063  30480 735867 335277     0
## 7  2427177 328781  17728   14505  676902   27398  48350 862244 515735  1612
## 8  1305163 188184  12127    8879  401056    7363  52189 747323 293645  6063
## 9  1800109 220997  17706    5246  445273   15739  58364 908252 398713  3935
## 10  464532  50493  33662    5049  323152    1683  42077 373645  69006  5049
## 11 1123229 161071  24534    7467  652817    2133  97069 489613 378676  1067
## 12 1190616 200326   7559    7559  302379    7559  18899 585859 238123  7559
## 13  774741 124581   7786    3893  288095   19466  81757 654053 140154     0
## 14 1256792 148831   4725    7087  344909    2362  49610 850461 203166  4725
## 15  645423  90422   9354       0  280619    3118  71714 545647 109129     0
##    jackass motherfucker motherfucking nigger   piss pissed pissy  pussy    shit
## 1    13354        12082          3179   5087  69948 169148 11446 197763 2352169
## 2     4668         4456          2546   2971  71081 152770  4456 103969 1733094
## 3        0        19458          3243   3243  64861 139452  6486 204313 2085293
## 4        0         5852          5852      0  67293 201880     0 152141 2390371
## 5     2115         7403          3173      0 102580 155457 10575  32783  905244
## 6     4354        30480             0  13063  65314 117565  4354 296089 3239557
## 7        0         9670          4835      0  78972 262702 11282 269149 3932477
## 8     1083         7579          7146   3032  57170 159816  4331 161332 2190864
## 9     3279        21641         11148   1967  53774 172469  2623 255097 2863124
## 10   10099            0             0      0  42077  72373  1683  26929  540270
## 11    7467         3200          3200   2133  87469 147204 13867  76802 1930716
## 12       0            0          3780      0  30238 143630     0 105833 2260280
## 13   19466         7786             0      0 105116 147941     0  70077 1666277
## 14       0         9450             0      0  61422 170092  2362 148831 2093078
## 15    3118         3118          3118      0  77950  96658  6236  34298 1172362
##    shittiest shitty  slut slutty twat whore
## 1       1908  52779 37518   3179 3179 46420
## 2       3395  41163 31615   5305 3819 32676
## 3       9729  22701 32431      0    0 25945
## 4          0  20481 29258      0    0 29258
## 5       2115  43359 35956   4230 2115 45474
## 6       4354  65314 21771   4354    0 13063
## 7          0  35457 19340   6447 4835 24175
## 8        866  25337 19273   3248 1083 20573
## 9        656   8525 16394   1967 1312 28198
## 10      1683  18514 10099   1683 6732 38711
## 11         0  43734 28801   7467 2133 55468
## 12         0   3780 11339      0    0 11339
## 13         0  15573 23359   3893    0 50611
## 14         0  16537 14174   4725 2362  7087
## 15      3118   9354  9354   6236 3118 40534

For each county, Grieve et al. (2017) measured the relative frequency per billion words of the word in all the Tweets originating from that county by dividing the frequency of that word in those Tweets by the total number of words in those Tweets and multiplying the product by 1 billion. These swear words are all in the top 10,000 most frequent word types in the corpus. Here is a summary of the swear words.

summary(norm_swear[, 2:ncol(norm_swear)])
##       ass             asshole          bastard           bitch        
##  Min.   :      0   Min.   :     0   Min.   :     0   Min.   :      0  
##  1st Qu.: 633706   1st Qu.: 42612   1st Qu.:  4940   1st Qu.: 522422  
##  Median : 861821   Median : 63864   Median :  9371   Median : 727234  
##  Mean   :1017433   Mean   : 67841   Mean   : 11091   Mean   : 790277  
##  3rd Qu.:1266972   3rd Qu.: 86219   3rd Qu.: 13983   3rd Qu.: 997230  
##  Max.   :8904228   Max.   :567215   Max.   :310376   Max.   :7340226  
##     bitched           bitchy           bloody          bullshit     
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.:     0   1st Qu.:  2679   1st Qu.:  3599   1st Qu.: 84137  
##  Median :  3385   Median :  6801   Median :  8789   Median :111607  
##  Mean   :  4892   Mean   :  8674   Mean   : 11888   Mean   :113840  
##  3rd Qu.:  6305   3rd Qu.: 11120   3rd Qu.: 14831   3rd Qu.:139169  
##  Max.   :508411   Max.   :283607   Max.   :591876   Max.   :714967  
##       cock              crap            crappy            cunt       
##  Min.   :      0   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.:   5282   1st Qu.: 60791   1st Qu.:  4989   1st Qu.:  7375  
##  Median :  11404   Median : 88122   Median :  9890   Median : 17099  
##  Mean   :  14377   Mean   : 98436   Mean   : 11988   Mean   : 21012  
##  3rd Qu.:  17611   3rd Qu.:124043   3rd Qu.: 14915   3rd Qu.: 28555  
##  Max.   :1242999   Max.   :821355   Max.   :244499   Max.   :435954  
##       damn             damnit           damned            darn       
##  Min.   :      0   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.: 578299   1st Qu.:  6920   1st Qu.:  4672   1st Qu.:  8763  
##  Median : 742309   Median : 14030   Median :  8998   Median : 13989  
##  Mean   : 794217   Mean   : 17003   Mean   : 11526   Mean   : 17009  
##  3rd Qu.: 944342   3rd Qu.: 22559   3rd Qu.: 14355   3rd Qu.: 20566  
##  Max.   :3846951   Max.   :368363   Max.   :235349   Max.   :263481  
##       dick            dickhead         douche         douchebag     
##  Min.   :      0   Min.   :    0   Min.   :     0   Min.   :     0  
##  1st Qu.: 106690   1st Qu.:    0   1st Qu.: 10956   1st Qu.:     0  
##  Median : 152438   Median :  837   Median : 20780   Median :  4606  
##  Mean   : 158934   Mean   : 2405   Mean   : 28157   Mean   :  7044  
##  3rd Qu.: 199728   3rd Qu.: 2874   3rd Qu.: 37733   3rd Qu.:  9234  
##  Max.   :1426300   Max.   :65772   Max.   :357483   Max.   :275330  
##     dumbass            dyke             fag             faggot      
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.: 16771   1st Qu.:     0   1st Qu.:  9901   1st Qu.:  9834  
##  Median : 25801   Median :  1280   Median : 18972   Median : 20574  
##  Mean   : 28587   Mean   :  2748   Mean   : 22735   Mean   : 25157  
##  3rd Qu.: 35497   3rd Qu.:  3747   3rd Qu.: 29998   3rd Qu.: 34166  
##  Max.   :301841   Max.   :133627   Max.   :301341   Max.   :308339  
##      fatass          freaking         friggin            fuck        
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :      0  
##  1st Qu.:     0   1st Qu.: 83361   1st Qu.:     0   1st Qu.: 982852  
##  Median :  2750   Median :118324   Median :  3148   Median :1392858  
##  Mean   :  3992   Mean   :129845   Mean   :  5194   Mean   :1429525  
##  3rd Qu.:  5219   3rd Qu.:161718   3rd Qu.:  6041   3rd Qu.:1835999  
##  Max.   :157093   Max.   :900328   Max.   :307630   Max.   :9527509  
##      fucked            fucker          fuckery          fucking       
##  Min.   :      0   Min.   :     0   Min.   :     0   Min.   :      0  
##  1st Qu.: 116560   1st Qu.: 12704   1st Qu.:     0   1st Qu.: 498194  
##  Median : 173033   Median : 21614   Median :  1763   Median : 724606  
##  Mean   : 177267   Mean   : 25471   Mean   :  3784   Mean   : 771294  
##  3rd Qu.: 232068   3rd Qu.: 32636   3rd Qu.:  5178   3rd Qu.: 991071  
##  Max.   :1133503   Max.   :261505   Max.   :277937   Max.   :4075971  
##     goddamn            gosh              hell              hoe         
##  Min.   :     0   Min.   :      0   Min.   :      0   Min.   :      0  
##  1st Qu.:  2484   1st Qu.:  47709   1st Qu.: 406799   1st Qu.:  64499  
##  Median : 10121   Median :  72283   Median : 498316   Median : 110292  
##  Mean   : 12654   Mean   :  82519   Mean   : 531068   Mean   : 155838  
##  3rd Qu.: 17097   3rd Qu.: 103681   3rd Qu.: 613539   3rd Qu.: 200547  
##  Max.   :231535   Max.   :2601908   Max.   :2770083   Max.   :1949566  
##       homo           jackass        motherfucker    motherfucking   
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.:     0   1st Qu.:     0   1st Qu.:  4362   1st Qu.:     0  
##  Median :  6559   Median :  4370   Median : 10143   Median :  3647  
##  Mean   :  7817   Mean   :  5617   Mean   : 11714   Mean   :  4903  
##  3rd Qu.: 10300   3rd Qu.:  7205   3rd Qu.: 15371   3rd Qu.:  6449  
##  Max.   :276932   Max.   :154447   Max.   :382482   Max.   :236967  
##      nigger            piss            pissed           pissy       
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.:     0   1st Qu.: 55306   1st Qu.:125064   1st Qu.:     0  
##  Median :  3021   Median : 71603   Median :160520   Median :  4475  
##  Mean   :  4693   Mean   : 75461   Mean   :167956   Mean   :  7000  
##  3rd Qu.:  6201   3rd Qu.: 91188   3rd Qu.:204546   3rd Qu.:  8865  
##  Max.   :295300   Max.   :475602   Max.   :747938   Max.   :293600  
##      pussy              shit            shittiest         shitty      
##  Min.   :      0   Min.   :       0   Min.   :    0   Min.   :     0  
##  1st Qu.:  59076   1st Qu.: 1207801   1st Qu.:    0   1st Qu.: 41936  
##  Median :  98155   Median : 1608277   Median : 2902   Median : 70336  
##  Mean   : 121016   Mean   : 1753821   Mean   : 4029   Mean   : 76185  
##  3rd Qu.: 154377   3rd Qu.: 2174110   3rd Qu.: 5868   3rd Qu.:102097  
##  Max.   :1488628   Max.   :12309084   Max.   :69425   Max.   :550661  
##       slut            slutty            twat            whore       
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :     0  
##  1st Qu.: 24019   1st Qu.:     0   1st Qu.:     0   1st Qu.: 25511  
##  Median : 38600   Median :  4738   Median :  3116   Median : 37334  
##  Mean   : 43288   Mean   :  5727   Mean   :  4634   Mean   : 41494  
##  3rd Qu.: 55727   3rd Qu.:  7935   3rd Qu.:  6107   3rd Qu.: 52540  
##  Max.   :547945   Max.   :150670   Max.   :130014   Max.   :547945

Mapping

Before we map our swearing data, we need to understand the basics of cartography in R.

Mapping the US

First, we need to get a map of the US, which we will format and use as a base to plot our swear word relative frequencies on to. There are several stages to setting up a nice map. Aside from the first step though, which just involves reading in the underlying map, they’re all optional.

Accessing Mapping Data in R

First, we need to get a US map. Fortunately working with US data is very easy in R, since all the necessary maps can be accessed in library(maps). We use ggplot’s map_data function to extract the relevant information from the package.

usa <- map_data("usa")

Now we’ll have a look at the very basic US map. For this we’ll need ggplot:

ggplot() + 
  geom_polygon(data = usa, 
               aes(x = long, y = lat, group = region))

Other countries

If you want to map other countries, you can download and read in the base mapping data (e.g. shapefiles), which are available from various different sources. This is especially interesting if you’re looking to work with administrative regions and the like. For country outlines, you can also use library(rworldmaps). This example below shows how to produce a map of Germany, Austria and Switzerland (= German-Speaking Area, GSA). rworldmap works with coordinates.

What this code chunk does is getting the world map and then creating a list of three countries by name. Then we create a map based on that list and in the next step we get the coordinates of those countries, so that we can use these for mapping.

worldMap <- getMap()
GSA <- c("Germany", "Austria", "Switzerland")
GSA_map <- which(worldMap$NAME%in%GSA)
GSA_coord <- lapply(GSA_map, function(i){
  df <- data.frame(worldMap@polygons[[i]]@Polygons[[1]]@coords)
  df$region = as.character(worldMap$NAME[i])
  colnames(df) <- list("long", "lat", "region")
  return(df)
})
GSA_coord <- do.call("rbind", GSA_coord)

After this, we can have a look at our three countries using ggplot. The coord_fixed argument makes sure that the relationship between x and y is correct; it fixes the aspect ratio.

gsa_map <- ggplot() + 
  geom_polygon(data = GSA_coord, 
               aes(x = long, y = lat, group = region)) +
  # this bit does the aspect ratio fix
  coord_fixed(1.3) 
gsa_map

Back to the US

Now, we need to make sure our US data can be mapped, which means we don’t just need the outline of the US, but we need the counties. We can extract them from our maps package.

counties <- map_data("county")
ggplot() + 
  geom_polygon(data = counties, 
               aes(x = long, y = lat, group = group),
               # to see the counties we add a colour for outline and filling
               color = "black", fill = "lightgrey", 
               size = .1 ) +
  coord_fixed(1.3)

Polishing our map

Now that we have a basic map of the US, we can make it look a bit nicer, so that subsequent maps are easier to read.

ggplot() + 
  geom_polygon(data = counties, 
               aes(x = long, y = lat, group = group),
               color = "black", fill = "white", 
               size = .1 ) +
  coord_fixed(1.3) +
  theme_minimal() +  # sets the theme for the plot
  ggtitle("US Map with Counties") + # gives the plot a title
  theme(axis.title.x = element_blank(), # removes x axis title, here longitude
        axis.title.y = element_blank(),# removes y axis title, here latitude
        axis.text.x = element_blank(), # removes x axis text, here coordinates
        axis.text.y = element_blank(), # removes y axis text, here coordinates
        panel.grid.major = element_blank(), # removes grid lines
        panel.grid.minor = element_blank(), # removes grid lines
        plot.title = element_text(hjust = 0.5)) # centres title

Data Wrangling

Now that we have a base map and our data read in, we need to make sure the data can be mapped. This might look a bit complicated, but what we’re doing is getting the coordinate data that we need to join our existing dataset.

First, we get a map of the counties (aka the geo-information we need) and save it as us_geo (and have a little look, colourful!). For this we need the package ‘sf’. We’re still using the same “maps” library as before, but since each county has multiple sets of coordinates, we need a format that can be matched to our dataset, where each location is just one row, hence we’re handling it with ‘sf’. We merge the two separate lists into one using dplyr.

us_geo <- st_as_sf(maps::map(database = "county", 
                             plot = FALSE, 
                             fill = TRUE))
plot(us_geo)

us_geo_swear <- us_geo %>%
  left_join(norm_swear, 
            by = c("ID" = "county"))

If you have a look at the new data frame us_geo_swear, you can see that it is essentially the same list as before, but that the last column contains another list, as every county has multiple coordinate points, which we need for plotting.

# shows us that it is a data frame
class(us_geo_swear)
## [1] "sf"         "data.frame"
# you can see that we now have a data frame that contains multipolygons
head(us_geo_swear) 
## Simple feature collection with 6 features and 53 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -88.01778 ymin: 30.24071 xmax: -85.06131 ymax: 34.2686
## Geodetic CRS:  WGS 84
##                ID     ass asshole bastard   bitch bitched bitchy bloody
## 1 alabama,autauga 1520421   49600    9538  962106    6995   8903   5087
## 2 alabama,baldwin 1246775   54318    6578  807348    2334   7851  14004
## 3 alabama,barbour 2263661   29188    3243  959948    3243   6486   3243
## 4    alabama,bibb 1451192   14629    2926 1009398       0   8777      0
## 5  alabama,blount  559433   72969    4230  506556    2115   5288   3173
## 6 alabama,bullock 2168413   56605       0 1184354       0   8708      0
##   bullshit  cock   crap crappy  cunt    damn damnit damned  darn   dick
## 1   120184 15897 146255  13354 22892 1206925  19077   8903 13990 210481
## 2    98452  7002 109910  10397  9124  907073   9760  10185 10185 113729
## 3    74591  3243 113507  19458  3243 1258310      0  25945 22701 136209
## 4   105328     0  90699   8777  2926 1176168  17555   2926 17555  93625
## 5   101523  9518 201988   6345  9518  469543  13748  23266  8460  59222
## 6   182878  8708  21771   4354     0 1240960      0      0  4354 300443
##   dickhead douche douchebag dumbass  dyke   fag faggot fatass freaking friggin
## 1     3179  14626      6359   43241  2544 42605  40697   6359   167876    2544
## 2     2971  18884      5729   29069  2546 19521  15489   4031   170593    2546
## 3        0   3243      3243   22701     0  9729      0      0   175126    3243
## 4     2926   5852         0   11703  2926  2926   8777      0   187251    8777
## 5     9518  13748      4230   25381     0 16920  32783   4230   195643    5288
## 6        0   4354         0   52251 17417 13063  52251      0    47897       0
##      fuck fucked fucker fuckery fucking goddamn   gosh   hell    hoe  homo
## 1 1441570 212388  21620    6359  592017    5087  69948 695667 268347 17169
## 2 1137714 139191  15065    2546  462767    5941  81690 573101 252920  7426
## 3 1115615 158910  12972    3243  376196    9729  64861 901573 369710  3243
## 4 1351715 236989  20481   17555  833850    8777  38035 506162 298431  2926
## 5  775168  88832  12690    3173  319374    4230 132191 379653 131134  1058
## 6 1941993 278672  21771    8708  574760   13063  30480 735867 335277     0
##   jackass motherfucker motherfucking nigger   piss pissed pissy  pussy    shit
## 1   13354        12082          3179   5087  69948 169148 11446 197763 2352169
## 2    4668         4456          2546   2971  71081 152770  4456 103969 1733094
## 3       0        19458          3243   3243  64861 139452  6486 204313 2085293
## 4       0         5852          5852      0  67293 201880     0 152141 2390371
## 5    2115         7403          3173      0 102580 155457 10575  32783  905244
## 6    4354        30480             0  13063  65314 117565  4354 296089 3239557
##   shittiest shitty  slut slutty twat whore                           geom
## 1      1908  52779 37518   3179 3179 46420 MULTIPOLYGON (((-86.50517 3...
## 2      3395  41163 31615   5305 3819 32676 MULTIPOLYGON (((-87.93757 3...
## 3      9729  22701 32431      0    0 25945 MULTIPOLYGON (((-85.42801 3...
## 4         0  20481 29258      0    0 29258 MULTIPOLYGON (((-87.02083 3...
## 5      2115  43359 35956   4230 2115 45474 MULTIPOLYGON (((-86.9578 33...
## 6      4354  65314 21771   4354    0 13063 MULTIPOLYGON (((-85.66866 3...
# If you open the data frame and scroll to the last column, 
# you can see the list in the list.
view(us_geo_swear) 

Now that the data is prepared, we can try and map some swear words. Note that we’ve added geom_sf to the plot. We do this because it can handle the sf data we’ve added for the geolocation of our swear words. That also means we don’t need geom_polygon, but by the name you can tell it has similar functionality.

This first map is a very basic choropleth map based on our variable “ass”:

ggplot() +
  geom_sf(data = us_geo_swear, 
          aes(fill = ass)) 

Let’s add our design to it:

ggplot() +
  geom_sf(data = us_geo_swear, 
          aes(fill = ass)) +
  theme_minimal() +  
  ggtitle("'Ass' Distribution in the US per County") + 
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_blank(), 
        axis.text.y = element_blank(), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        plot.title = element_text(hjust = 0.5)) 

That looks sort of like what we want, so let’s rework it a bit. Note that we divide the occurrences of ‘ass’ by 10.000, since we’re dealing with high numbers we can thus make our graph easier to read this way.

ggplot() +
  geom_sf(data = us_geo_swear, 
          aes(fill = ass / 10000), 
          lwd = 0.1, # lwd sets the outline thickness of the polygons
          color = "grey") + # this sets the outline colour
  theme_minimal() +  
  ggtitle("'Ass' Distribution in the US per County") + 
  # this adds a new legend title with line break \n
  guides(fill = guide_legend(title = "Distribution \nin 10.000")) + 
  # here we start using some nicer colours
  scale_fill_continuous(low = "white", 
                        high = "mediumpurple4") + 
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_blank(), 
        axis.text.y = element_blank(), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        plot.title = element_text(hjust = 0.5),
        legend.title = element_text(size = 8))

We can see that there seems to be a trend towards ass in the Southeast. Let’s see if we can see some more trends.

ggplot() +
  geom_sf(data = us_geo_swear, 
          aes(fill = dickhead / 10000), 
          lwd = 0.1, 
          color = "grey") + 
  theme_minimal() +  
  ggtitle("'Dickhead' Distribution in the US per County") + 
  guides(fill = guide_legend(title = "Distribution \nin 10.000")) + 
  scale_fill_continuous(low = "white", 
                        high = "mediumpurple4") + 
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_blank(), 
        axis.text.y = element_blank(), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        plot.title = element_text(hjust = 0.5),
        legend.title = element_text(size = 8))

How about fuck, but in green?

ggplot() +
  geom_sf(data = us_geo_swear, 
          aes(fill = fuck / 10000), 
          lwd = 0.1, 
          color = "grey") + 
  theme_minimal() +  
  ggtitle("'Fuck' Distribution in the US per County") + 
  guides(fill = guide_legend(title = "Distribution \nin 10.000")) + 
  scale_fill_continuous(low = "white", 
                        high = "aquamarine4") + # green this time?
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_blank(), 
        axis.text.y = element_blank(), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        plot.title = element_text(hjust = 0.5),
        legend.title= element_text(size = 8))

Quantiles

In the next step for the swearing maps we’ll implement quantiles. What that means is we split the relative frequency distribution for the word we want to map into intervals. We’re using “quantile” style intervals here, where the values are split so each interval contains a roughly equal number of values, although the range of each interval will likely vary (often considerably).

In order to do this we’ll first pick a swear word, it’s location and create a new list. Then we’ll calculate the quantiles for our swear word and add this as a factor to our list. Exchange the swear word in this code to run it with a different one.

# select the columns you need
quant_swear <- us_geo_swear %>% 
  select(bitch, geom) 
# calculate quantiles
q <- quantile(quant_swear$bitch, 
              na.rm = TRUE) 
# add factor given the quantiles to our list
quant_swear$quant <- factor(findInterval(quant_swear$bitch, q)) 

Now we can map our data. Instead of filling the polygons by the frequency of our swear word, we use the quantiles we’ve just defined. Note that that means we’re going from continuous scale colours to discrete, so we need to change the colouring option of our map. That’s why we first define these colours.

cols <- c("1" = "white", 
          "2" = "lightsteelblue1", 
          "3" = "lightsteelblue2", 
          "4" = "lightsteelblue3", 
          "5" = "lightsteelblue4")
ggplot() +
  # we've added na.omit to not have NAs plotted 
  geom_sf(data = na.omit(quant_swear), 
          aes(fill = quant), 
          lwd = 0.1, 
          color = "grey") + 
  # here we pass our colour list
  scale_colour_manual(values = cols, 
                      #and say we use it to fill
                      aesthetics = c("colour", "fill")) + 
  theme_minimal() +  
  ggtitle("'Bitch' Quantile Distribution in the US") + 
  guides(fill = guide_legend(title = "Quantiles")) + 
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_blank(), 
        axis.text.y = element_blank(), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        plot.title = element_text(hjust = 0.5),
        legend.title = element_text(size = 8))

Let’s map the quantiles of another swear word and change the colours for the map. If you want to play around with colour yourself, this website offers a good overview.

quant_swear <- us_geo_swear %>% select(shit, geom) 
q <- quantile(quant_swear$shit, na.rm = TRUE) 
quant_swear$quant <- factor(findInterval(quant_swear$shit,q)) 
cols <- c("1" = "white", 
          "2" = "rosybrown1", 
          "3" = "rosybrown2", 
          "4" = "rosybrown3", 
          "5" = "rosybrown4")
ggplot() +
  geom_sf(data = na.omit(quant_swear), 
          aes(fill = quant), 
          lwd = 0.1, 
          color = "grey") + 
  scale_colour_manual(values = cols, 
                      aesthetics = c("colour", "fill")) + 
  theme_minimal() +  
  ggtitle("'Shit' Quantile Distribution in the US") + 
  guides(fill = guide_legend(title = "Quantiles")) + 
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_blank(), 
        axis.text.y = element_blank(), 
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        plot.title = element_text(hjust = 0.5),
        legend.title = element_text(size = 8))

Adding cities

As the last bit, we’ll try out adding another layer to our ggplot maps. Remember our map for the German-speaking area.

gsa_map

If we wanted to add cities to this, because we’re interested in looking at a city level population, we can do this using geom_point. Let’s first create some sample data to do this.

gsa_data <- data.frame(
  City_name = c("Cologne", "Munich", "Vienna", "Bern", "Berlin", "Hamburg", "Kassel", "Graz"), 
  Count1 = c(19, 4, 2, 5, 10, 43, 18, 7), 
  Count2 = c(20, 5, 1, 3, 21, 57, 28, 4),
  Proportion = c(38.78, 44.44, 66.67, 62.5, 32.26, 43.0, 39.13, 63.64),
  Long = c(6.9578, 11.5755, 16.3731, 7.4474, 13.3833, 10, 9.4912, 15.4409),
  Lat = c(50.9422, 48.1372, 48.2083, 46.948, 52.5167, 53.55, 51.3166, 47.0749))
gsa_data
##   City_name Count1 Count2 Proportion    Long     Lat
## 1   Cologne     19     20      38.78  6.9578 50.9422
## 2    Munich      4      5      44.44 11.5755 48.1372
## 3    Vienna      2      1      66.67 16.3731 48.2083
## 4      Bern      5      3      62.50  7.4474 46.9480
## 5    Berlin     10     21      32.26 13.3833 52.5167
## 6   Hamburg     43     57      43.00 10.0000 53.5500
## 7    Kassel     18     28      39.13  9.4912 51.3166
## 8      Graz      7      4      63.64 15.4409 47.0749

Note that we again have a dataset which contains both the linguistic information (here the counts and proportion) and the geolocation information. With this, we can map the data using the cities.

First, we again use our coordinates to create the basic map of the GSA, just as we did before. Only in the geom_point layer do we add the city data.

ggplot() + 
  geom_polygon(data = GSA_coord, 
               aes(x = long, y = lat, group = region),
               # sets outline and fill sets the filling of the GSA
               colour = "black", 
               size = 0.1, 
               fill = "snow3") + 
  coord_map(xlim = c(4.5, 17),  # this cuts the map to the coordinates we need
            ylim = c(45.5, 55)) + 
  theme_minimal() +  
  geom_point(data = gsa_data, # here we add the cities to our map
             aes(x = Long, y = Lat, col = Proportion, size = (Count1+Count2)), 
             alpha = 0.9)  +
  guides(size = FALSE) +
  scale_color_gradient(low = "seagreen3", high = "mediumpurple3") +
  ggtitle("Feature 1 vs Feature 2 in the GSA") +
  theme(axis.title.x = element_blank(), 
        axis.title.y = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        panel.grid.major = element_blank(),
        plot.title = element_text(hjust = 0.5))

What this map shows us is the proportion of usage of the two feature in the given cities. Germany shows more feature 1 use, whereas Austria and Switzerland tend to use more feature 2 in our made-up data. The proportion baseline is feature 1. The size of the cities is dependent on the occurrences of both features combined.

Saving your output

As the last step we want to save our map.

ggsave("germany_map.png", width = 6.5, height = 5.5)